In brief, the grammar tells us that a grapchic is a mapping from data to aesthetic attributes (position, color/fill, shape, size) of geometric objects (lines, points, bars, etc.). The plot may also include statistical transformations of the data and information about the plot’s coordinate system. Facetting can be used to plot for different subsets of the data. The combination of these independent components are what make up a graphic.
All plots are composed of the data, the information you want to visualize, and a mapping, the description of how the data’s variables are mapped to aesthetic attributes. There are five mapping components:
Every ggplot2 plot has three key components:
In general, there are 3 purposes for a layer:
Each layer can come from a different data set and have a different aesthetic mapping, making it possible to create sophisticated plots.
In ggplot2, a layer can be created using geom_ or stat_. This section also covers position adjustments within a layer.
Note that for stats and geoms, you pass additional parameter in .... Whereas for position adjustments, you supply additional parameters by calling the position adjustment function.
Geoms can be roughly devided into individual and collective geoms.
geom_point() draws one point per row.Each geom can be used for many different purposes, especially if you are creative.
These geoms are the fundamental building blocks of ggplot2. They are useful in their own right, but are also used to construct more complex geoms. Also, most of these goems are associated with a named plot.
Each of these geoms is two dimensional and requires both x and y aesthetics. All of them understand color and size aesthetics, and the filled ones (bar, tile, polygon) also understand fill.
geom_area(): draws an area plot which is a line plot filled to the y-axis. Multiple groups will be stacked on top of each other.geom_bar(stat = "identity"): makes a bar chart. The default stat = count counts values by frequency. Multiple bars in the same location will be stacked on top of each other.geom_line(), geom_path(): makes a line plot. The group aesthetic determines which observations are connected. geom_line() connects points from left to right; geom_path() is similar but connects points in the order they appear in the data. Both geom_line() and geom_path() also understand the linetype aesthetic, which maps a categorical variable to solid, dotted and dashed lines.geom_point(): produces a scatterplot. geom_point() also understands the shape aesthetic.geom_polygon(): draws polygons, which are filled paths. Each vertext of the polygon is a separate row in the data. It is often useful to merge a data frame of polygon coordinates with the data prior to plotting.geom_rect(), geom_tile(), geom_raster(): draws rectangles. geom_rect() is parameterized by the four corners of the rectange: xmin, ymin, xmax, and ymax. geom_tile() is eactly the same but parameterized by the center of the rectange and its size: x, y, width, and height. geom_raster() is a fast special case of geom_tile() used when all the tiles are the same size.group aestheticThe group aesthetic controls how to assign observations to different aesthetics. By default, the group aesthetic is mapped to the interaction of all discrete variables in the plot. This often partitions the data correctly, for example, you plot a bar chart of score by student’s name. If all variables in the plot are continuous variables, the default grouping variable will be a constant (e.g. group = 1).
A few key points about the group aesthetic:
ggplot(), for example, ggplot(df, aes(x, y, group = z)), it will be applied to all layers. But this can be overridden later by applying aes(group =) on each layer.aes(group = continuous-var-name).aes(group = interaction(var-name-1, var-name-2)).Sections in below are a few examples where the default isn’t enough. Data used in these examples are a simple longitudinal data set called Oxboys from the nlme package. It records height and centered age measured on 9 occasions. Subject and Occasion are stored as ordered factors.
data(Oxboys, package = "nlme")
head(Oxboys)
## Subject age height Occasion
## 1 1 -1.0000 140.5 1
## 2 1 -0.7479 143.4 2
## 3 1 -0.4630 144.8 3
## 4 1 -0.1643 147.1 4
## 5 1 -0.0027 147.7 5
## 6 1 0.2466 150.2 6
To create spaghetti plots by subject, the default
ggplot(Oxboys, aes(age, height)) + ## Both age and height are continuous variables, default group = 1
geom_line() +
geom_point()
To get what we wanted,
ggplot(Oxboys, aes(age, height, group = Subject)) + ## override default grouping
geom_line() +
geom_point()
This typically happens when we want to display one layer for individual data, and another layer for an overall summary.
For example, suppose we want to draw lines by subject and then fit a single smooth line across all subjects. To avoid getting a bunch of smooth lines one for each subject:
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group = Subject)) + ## if grouping specified on ggplot(), it will get applied to all layers
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
The following code will generate the same output by overriding the grouping on ggplot().
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_smooth(aes(group = 1), method = "lm", se = FALSE)
Here is another example of overriding the default grouping. Suppose we want to draw boxplots by occasion, also lines by subjects. Code can be written like this:
ggplot(Oxboys, aes(Occasion, height)) + ## group by the discrete variable Occasion by default
geom_boxplot() + ## create boxplot by Occasion
geom_line(aes(group = Subject), color = "#3366FF", alpha = 0.5) ## overriding default grouping
What happends when a single aesthetic (e.g color) is mapped to different geometric objects (e.g. lines and points)?
In ggplot2, this is handled differently for different collective geoms. Lines and paths operate on a “first value” principle: each segment is defined by two observations, and ggplot2 applies the aesthetic value (e.g. color) associated with the first observation when drawing the segment.
df <- data.frame(x = 1:3, y = 1:3, color = c(1, 3, 5))
## color is discrete
ggplot(df, aes(x, y, color = factor(color))) + ## color specified on ggplot() applies to all layers
geom_line(aes(group = 1), size = 2) + ## default grouping is the color variable - discrete due to factor()
geom_point(size = 5) ## aes(group = 1), if specified, has no effect. why??
## color is continuous
ggplot(df, aes(x, y, color = color)) + ## legend is different than discrete
geom_line(aes(group = 1), size = 2) + ## group = 1 can be left out as it is the default anyway in this case
geom_point(size = 5)
Note that even when color is a continuous variable, ggplot2 does not smoothly blend from one aesthetic value to another. If this is the behavior you want, you can perform the linear interpolation yourself:
xgrid <- with(df, seq(min(x), max(x), length = 50))
interp <- data.frame(x = xgrid,
y = approx(df$x, df$y, xout = xgrid)$y,
color = approx(df$x, df$color, xout = xgrid)$y
)
ggplot(interp, aes(x, y, color = color)) +
geom_line(size = 2) +
geom_point(data = df, size = 5)
What about other collective geoms, such as polygons? Most collective geoms are more complicated than lines and paths, and a single geometric object can map onto many observations. A general rule, however, is that if the aesthetic values differ within a group, ggplot2 will use the default value instead of trying to apply the “conflicting” aesthetic values.
These issues are most relevant when mapping aesthetics to continuous variables. For example,
ggplot(mpg, aes(class, fill = hwy)) + ## hwy is a continuous variable
geom_bar() ## default grouping is class, so each bar is associated with multiple fills - doesn't work, resort to default
ggplot(mpg, aes(class, fill = hwy, group = hwy)) + ## overriding default grouping
geom_bar()
## what if using fill = factor(hwy) - discrete legend
ggplot(mpg, aes(class, fill = factor(hwy))) +
geom_bar()
Note that the bars are stacked in the order defined by the grouping variable hwy. if you need fine control, you’ll need to create a factor with levels ordered as needed.
geom_Each geom has a set of aesthetics that it understands, some of which are required. The list of aesthetics for each geom is listed in the documentation.
geom_blank(): display nothing. Most useful for adjusting axes limits using data.geom_point(): pointsgeom_path(): pathsgeom_ribbon(): ribbons, a path with vertical thickness. check also geom_area().geom_segment(): a line segment, specified by start and end position. Check also geom_line().geom_rect(): rectangles, check also geom_tile() and geom_polygon().geom_polygon(): filled polygonsgeom_text(): textgeom_bar(): distribution of discrete variablegeom_histogram(): bin and count continuous variable, display with barsgeom_density(): smoothed density estimategeom_dotplot(): stack individual points into a dot plotgeom_freqpoly(): bin and count continuous variable, display with linesgeom_point(): scatterplotgeom_quantile(): smoothed quantile regressiongeom_rug(): marginal rug plotsgeom_smooth(): smoothed line of best fitgeom_text(): text labelsgeom_bin2d(): bin into rectangles and countgeom_density2d(): smoothed 2d density estimategeom_hex(): bin into hexagons and countgeom_count(): count number of point at distinct locationsgeom_jitter(): randomly jitter overlapping pointsgeom_bar(stat = "identify"): bar chart of precomputed summariesgeom_boxplot(): boxplotsgeom_violin(); show density of values in each groupgeom_area(): area plotgeom_line(): line plotgeom_step(): step plotgeom_crossbar(): vertical bar with centergeom_errorbar(): error barsgeom_linerange(): vertical linegeom_pointrange(): vertical line with centergeom_map(): fast version of geom_polygon() for map data.geom_contour(): contoursgeom_tile(): tile the plane with rectanglesgeom_raster(): fast version of geom_tile() for equal sized tiles.A statistical transformation, or stat, transforms data, typically by summarizing it in some manner. For example, a useful stat is the smoother, which calculates the smoothed mean of y, conditional on x.
stat_Here is a list of stat used behind the scenes to generate the corresponding geoms. We rarely call these functions directly, but they are useful to know because their documentation often provides more detail about the corresponding statistical transformation.
stat_bin(): geom_bar(), geom_freqpoly(), geom_histogram()stat_bin2d(): geom_bin2d()stat_bindot(): geom_dotplot()stat_binhex(): geom_hex()stat_boxplot(): geom_boxplot()stat_contour(): geom_contour()stat_quantile(): geom_quantile()stat_smooth(): geom_smooth()stat_sum(): geom_count()Other stats can’t be created with a geom_ function:
stat_ecdf(): compute an empirical cumulative distribution plotstat_function(): compute y values from a function of x valuesstat_summary(): summarize y values at distinct x valuesstat_summary2d(), stat_summary_hex(): summarize binned valuesstat_qq(): perform calculations for a quantile-quantile plotstat_spoke(): convert angle and radius to positionstat_unique(): remove duplicated rows.There are two ways to use these functions. You can either add a stat_ function and override the default geom, or add a geom_ function and override the default stat. For example, the code below generates the same output.
ggplot(mpg, aes(x = trans, y = cty)) +
geom_point() +
stat_summary(geom = "point", fun = "mean", color = "red", size = 4)
ggplot(mpg, aes(trans, cty)) +
geom_point() +
geom_point(stat = "summary", fun = "mean", color = "red", size = 4)
The second form is better because it makes it more clear that you are displaying a summary, not the raw data.
Internally, a stat takes a data frame as input and returns a data frame as output, and so a stat can add new variables to to the original data frame. Each stat lists the variables that it computes/creates in its documentation. It is possible to map aesthetics to the these new variables.
For example, stat_bin, the default stat used to make histograms, produces the following variables:
count: the number of observations in each bin. count is the default used by the histogram geom for the height of bars.density: the density of observations in each bin (percentage of total further divided by bar width)x: the center of the binncount: count, scaled to maximum of 1ndensity: density, scaled to maximum of 1To refer to a generated variable, like desnsity, after_stat() must be used to wrap the variable name. This prevents confusion in case the original data set includes a variable with the same name as the generated variable, and it’s clear to any later reader of the code as well.
p <- ggplot(diamonds, aes(price)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 500)
This technique is particularly useful when you want to compare the distribution of multiple groups that have very different sizes. For example, it’s hard to compare the distribution of price within cut because some groups are quite small. It’s easier to compare if we standardize each group to take up the same area (i.e. 1).
p1 <- ggplot(diamonds, aes(price, color = cut)) +
geom_freqpoly(binwidth = 500)
p2 <- ggplot(diamonds, aes(price, color = cut)) +
geom_freqpoly(aes(y = after_stat(density)), binwidth = 500)
p1 + p2 & theme(legend.position = "none")
Position adjustments tweak the position of elements in a layer. For example, for bar plots, we can stack or dodge (side-by-side) the bars.
Continuous data typically doesn’t overlap exactly, and when it does (because of high density), minor adjustments like jittering, are ofthen not sufficient to fix the problem. For this reason position adjustments are generally most useful for discrete data.
position_Here is a list of position adjustments functions.
position_stack(): stack overlapping bars (or areas) on top of each other, default for bar plots, e.g. geom_bar() and geom_col().position_fill(): stack bars, and standardize each bar to have a constant height (at 1)position_dodge(): place overlapping bars (or boxplots) side by side.position_identity(): don’t adjust positionposition_nudge(): move position of items (e.g. points) on discrete scales by a small amount, e.g. nudging is built into geom_text() to avoid label overlapping.position_jitter(): add a little random noise to every position, at least one discrete position.position_jitterdodge(): dodge points within groups, then add a little random noise.The way you pass parameters to position adjustments differs to stats and geoms. Instead of including additional arguments in ..., you construct a position adjustment object, supplying additional arguments in the call:
p1 <- ggplot(mpg, aes(displ, hwy)) +
geom_point(position = "jitter")
p2 <- ggplot(mpg, aes(displ, hwy)) +
geom_point(position = position_jitter(width = 0.5, height = 0.5))
p1 + p2
This is rather verbose, so geom_jitter() provides a convenient shortcut:
ggplot(mpg, aes(displ, hwy)) +
geom_jitter(width = 0.5, height = 0.5)
Scales in ggplot2 control the mapping from data to aesthetics. They take data and turn it into something you can see, like position, color, size or shape. They also provide the tools that let you interpret the plot: the axes and legends.
An important principle in ggplot2 is that every aesthetic in your plot is associated with exactly one scale.
Formally each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale). The axis or legend is the inverse function: it allows you to convert visual properties back to data.
In ggplot2, legend and axes are known collectively as guides.
The scale functions intended for users all follow a common naming scheme, made up of three pieces separated by "_":
scalecolor, shape, x)continuous, discrete, brewer, distiller)All scale functions belong to one of three fundamental types: continuous scales, discrete scales, and binned scales. Each fundamental type is handled by one of three scale constructor functions: continuous_scale(), discrete_scale(), and binned_scale(). Although you should never need to call these constructor functions, it’s helpful to refer to the documentation for details of how the scale functions work.
An important property of ggplot2 is the principle that every aesthetic in your plot is associated with exactly one scale. For instance, when you write:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class))
ggplot2 added a default scale for each aesthetic unsed in the plot:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_color_discrete()
The choice of default scale depends on the aesthetic and the variable type. Specifying the default scales would be tedious so ggplot2 does it for you. But if you want to override the defaults, you’ll need to add the scale yourself.
You use + to add scales to a plot. If you supply two scales for the same aesthetic, the last scale takes precedence. In other words, when you + a scale, you’re not actually adding it to the plot, but overriding the existing scale.
If you’re making small tweaks to the scales, you might continue to use the default scales, by supplying a few extra arguments. If you want to make more radical changes, you will override the default scales with alternative. For example,
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
scale_x_sqrt() +
scale_color_brewer()
Every plot has two position scales corresponding to the x and y aesthetics. Typically user specifies the variables mapped to x and y explicitly, but sometimes an aesthetic is mapped to a computed variable, as happens with geom_histogram(), and does not need to be explicitly specified. For example, the following plot specifications are equivalent. Although the first example does not state the y-aesthetic mapping explicitly, it still exists and is associated with (in this case) a continuous position scale.
ggplot(mpg, aes(x = displ)) + geom_histogram()
ggplot(mpg, aes(x = displ, y = after_stat(count))) + geom_histogram()
The list of common position scales is provided in this section. On top of that, important concepts related to position scales are illustrated by examining some arguments. Most of these arguments are shared by continuous_scale(), discrete_scale(), and/or binned_scale().
scale_xscale_x_continuous():scale_x_log10():scale_x_sqrt():scale_x_reverse(): reverse the range of x-axis, e.g. from (20, 40) to (40, 20)scale_x_date():scale_x_datetime():scale_x_time():scale_x_discrete():scale_x_binned():oob argumentBy default, ggplot2 converts data outside the scale limits to NA. This means that changing the limits of a scale is not precisely the same as visually zooming in to a region of the plot. If your goal is to zoom in part of the plot, you should use the xlim and ylim arguments of coord_cartesian().
Although the default behavior is to convert the out-of-bound values to NA, you can override this by setting oob argument of the scale, a function that is applied to all observations outside the scale limits (set by the limits argument):
scales::censor() replaces out of bounds values with NAscales::squish() for squishing out of bounds values into rangescales::squish_infinite() for squishing infinite values into range.expand argumentThe expand argument adds some offset (padding) around data so that they do not overlap the axes. The defaults are to expand the scale by 5% on each side for continuous scales, and by 0.6 units on each side for discrete variables.
You can eliminate the expansion with expand = c(0, 0) for continuous position scales. One scenario where it is usually prefereable to remove the expansion is when using geom_raster():
p1 <- ggplot(faithfuld, aes(waiting, eruptions)) +
geom_raster(aes(fill = density)) +
theme(legend.position = "none")
p2 <- ggplot(faithfuld, aes(waiting, eruptions)) +
geom_raster(aes(fill = density)) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme(legend.position = "none")
p1 + p2
breaks argumentYou can supply a vector of breaks, but ggplot2 also allows you to pass a function to breaks. This function should have one argument that specifies the limits of the scale (a numeric vector of length two), and it should return a numeric vector of breaks.
You can write your own break function, but in many cases there is no need, thanks to the scales package.
scales::breaks_extended(n = 5, ...): returns a function that takes a vector as input, and creates automatic breaks for numeric axesscales::breaks_log(): creates breaks appropriate for log axesscales::breaks_pretty(): creates “pretty” breaks for date/timescales::breaks_width(width, offset = 0): creates equally spaced breaksbreaks_extended() is the standard method used in ggplot2. You can alter the desired number of breaks by updating the n = argument.
minor_breaks argumentMinor breaks are particularly useful for log scales because they give a clear visual indicator that the scale is non-linear.
As with breaks, you can also supply a function to minor_breaks, such as scales::minor_breaks_n(n), scales::minor_breaks_width(width, offset).
labels argumentThe labels argument to the scale function is used to set the labels for breaks.
You can supply a character vector giving labels (must be same length as breaks) manually. But ggplot2 also allows you to pass a labeling function. A function passed to labels should accept a numeric vector of breaks as input and return a character vector of labels (same length as the input).
The scales package provides a number of tools that will automatically construct label functions for you:
scales::label_bytes(): formats numbers as kilobytes, megabytes etc.scales::label_comma(): formats numbers as decimals with commas addedscales::label_dollar(): formats numbers as currencyscales::label_ordinal(): formats numbers in rank order: 1st, 2nd, 3rd etc.scales::label_percent(): formats numbers as percentagesscales::label_pvalue(): formats numbers as p-values: <.05, <.01, .34 etc.trans argumentWhen working with continuous data, the default (trans = "identity") is to map linearly from the data space onto the aesthetic space. It is possible to override this default using transformations. The most common use of scale transformations is to adjust continuous position scales.
The transformation is carried out by a “transformer”, which describes the transformation, its inverse, and how to draw the labels. You can construct your own transformer using scales::trans_new(). Here is a list of existing common transformations supplied by the scales package:
| Name | Function \(f(x)\) | Inverse \(f^{-1}(y)\) |
|---|---|---|
| asn | \(tanh^{-1}(x)\) | \(tanh(y)\) |
| exp | \(e^{x}\) | \(log(y)\) |
| identity | \(x\) | \(y\) |
| log | \(log(x)\) | \(e^y\) |
| log2 | \(log_2(x)\) | \(2^y\) |
| log10 | \(log_{10}(x)\) | \(10^y\) |
| logit | \(log(\frac{x}{1+x})\) | \(\frac{1}{1+e^y}\) |
| pow10 | \(10^x\) | \(log_{10}(y)\) |
| probit | \(\Phi(x)\) | \(\Phi^{-1}(y)\) |
| reciprocal | \(x^{-1}\) | \(y^{-1}\) |
| reverse | \(-x\) | \(-y\) |
| sqrt | \(x^{1/2}\) | \(y^2\) |
To simplify matters, ggplot2 provides convenience functions for the most common transformations: scale_x_log10(), scale_x_sqrt(), scale_x_reverse().
Instead of transforming the scales, you can also manually transform the data. For example, instead of using scale_x_log10() to transform the scale, you could transform the data instead and plot log10(x). The appearance of the geom will be the same, but the tick labels will be different. Specifically, if you use a transformed scale, the axes will be labeled in the original data space; if you transform the data, the axes will be labeled in the transformed space.
## no transformation
p0 <- ggplot(mpg, aes(displ, hwy)) +
geom_point()
## transforming data
p1 <- ggplot(mpg, aes(log10(displ), hwy)) +
geom_point()
## transforming scale
p2 <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_log10()
p0 + p1 + p2
Regardless of which method you use, the transformation occurs before any statistical summaries. To transform after statistical computation, use coord_trans().
limits argumentAll scales have limits that define the domain over which the scale is defined, and are usually derived from the range of the data (by default).
It is most natural to think about the limits of position scales: they map directly to the ranges of the axes. But limits also apply to scales that have legends, like color, size, and shape, and these limits are particularly important if you want colors to be consistent across multiple plots.
limits for continuous scalesFor continuous scales, limits should be a numeric vector of length two. If you only want to set the upper or lower limit, you can set the other value to NA.
Manually setting scale limits is a common task when you need to ensure that scales in different plots are consistent. When you create a faceted plot, ggplot2 automatically does this for you.
However, it is sometimes necessary to maintain consistency across multiple plots, which has the property (often undesirable) of causing each plot to set scale limits independently when rendered. For example,
mpg_99 <- subset(mpg, year == 1999)
mpg_08 <- subset(mpg, year == 2008)
p1 <- ggplot(mpg_99, aes(displ, hwy, color = fl)) +
geom_point()
p2 <- ggplot(mpg_08, aes(displ, hwy, color = fl)) +
geom_point()
p3 <- p1 +
scale_x_continuous(limits = c(1, 7)) +
scale_y_continuous(limits = c(10, 45)) +
scale_color_discrete(limits = c("c", "d", "e", "p", "r"))
p4 <- p2 +
scale_x_continuous(limits = c(1, 7)) +
scale_y_continuous(limits = c(10, 45)) +
scale_color_discrete(limits = c("c", "d", "e", "p", "r"))
(p1 / p3) | (p2 / p4)
lims(), xlim(), ylim()Because modifying scale limits is such a common task, ggplot2 provides some convenience functions to make this easier. See Section 7.3.3 for details.
A special case arises when an aesthetic is mapped to a date/time type, such as
Date, for datesPOSIXct classes, for date-timeshms class, for “time of day” values provided by the hms packageIf your dates are in a different format, you will need to convert them using as.Date(), as.POSIXct(), or hms::as_hms(). You may also find the lubridate package helpful to manipulate date/time data.
Date scales behave similarly to other continuous scales, but contain additional arguments that allow you to work in date-friendly units.
date_breaks argumentThe date_breaks argument allows you to position breaks by date units (years, months, weeks, days, hours, minutes, and seconds). For example, date_breaks = "25 years" will place a major tick mark every 25 years. Note that if both breaks and date_breaks are specified, then date_breaks wins.
It may be useful to note that internally date_breaks = "25 years" is treated as a shortcut for breaks = scales::breaks_width(width = "25 years"). The longer form is typically unnecessary, but it can be useful if you wish to specify an offset (see Section 3.2.2.3).
century20 <- as.Date(c("1900-01-01", "1999-12-31"))
brks <- scales::breaks_width("25 years")
brks(century20)
## [1] "1900-01-01" "1925-01-01" "1950-01-01" "1975-01-01" "2000-01-01"
brks2 <- scales::breaks_width("25 years", offset = 31)
brks2(century20)
## [1] "1900-02-01" "1925-02-01" "1950-02-01" "1975-02-01" "2000-02-01"
date_minor_breaks argumentIf both minor_breaks and date_minor_breaks are specified, date_minor_breaks wins.
date_labels argumentIf both labels and date_labels are specified, date_labels wins.
date_labels controls the display of the labels using the same formatting strings as in strptime() and format(). For example, to display dates like 14/10/1979, you would use the string "%d/%m/%Y".
One useful scenario for date label formatting is when there’s insufficient room to specify a four-digit year. Using %y ensures that only the last two digits are displayed. You can also include the line break character \n in a formatting string, particularly when full-length month names are included.
base <- ggplot(economics, aes(date, psavert)) +
geom_line(na.rm = TRUE) +
labs(x = NULL, y = NULL)
p1 <- base + scale_x_date(date_breaks = "5 years", date_labels = "%y")
p2 <- base + scale_x_date(limits = c(as.Date(c("2004-01-01", "2005-01-01"))), date_labels = "%B\n%Y")
p1 + p2
An alternative to using date_labels is to pass a labeling function to the labels argument (see Section 3.2.2.5). The scales package provides two convenient functions that will generate labeling functions for you:
label_date(): is what date_labels does behind the scenes, so you rarely need to call it directly.label_date_short(): automatically constructs short labels that are sufficident to uniquelly identify the dates.Internally, ggplot handles discrete scales by mapping each category to an integer value, then drawing the geom at the corresponding coordinate location. To illustrate this, we can add a custom annotation to the plot:
ggplot(mpg, aes(x = hwy, y = class)) +
geom_point() +
annotate("text", x = 5, y = 1:7, label = 1:7)
limits argumentFor discrete scales, limits should be a character vector that enumerates all possible values.
labels argumentWhen the data are categorical, you can use a named vector to set the labels associated with particular values. This allows you to change some labels and not others, without altering the ordering or the breaks.
There also exists functions relevant to discrete data, sjuch as scales::label_wrap() which allows you to wrap long strings across lines.
A variation on discrete position scales are binned scales, where a continuous variable is sliced into multiple bins and the discretized variable is then plotted.
ggplot(mpg, aes(hwy, class)) +
geom_count() + ## `geom_count()`: the size of dots scales with the number of observations
scale_x_binned(n.breaks = 10)
At the physical level, color is produced by a mixture of wavelengths of light. To characterize a color completely, we need to know the complete mixture of wavelengths. Fortunately for us the human eye only has three different color receptors, and so we can summarize the perception of any color with just three numbers.
You may be familiar with the RGB encoding of color space, which defines a color by the intensities of red, green and blue light needed to produced it. On problem with this space is that it is perceptually uniform: the two colors that are one unit apart may look familiar or very different depending on where they are in the color space. This makes it difficult to create a mapping from a continuous variable to a set of colors.
There have been many attempts to come up with color spaces that are more perceptually uniform. A modern attempt is HCL color space, which has three components of Hue, chroma and luminance:
The three dimensions have different properties. Hues are arranged around a color wheel, and are not perceived as ordered, e.g. green does not seem “larger” than red. In contrast, both chroma and luminance are perceived as ordered: pink is perceived as lying between red and white.
scale_color_Note that any time I refer to scale_fill_*(), there is a corresponding scale_color_*().
scale_fill_continuous(): default scale for continuous fill scales. It defaults to scale_fill_gradient() behind the scenes.scale_fill_viridis_c(): derived from the viridis scales, from the colorspace package, shipped with ggplot2scale_fill_distiller(): derived from the ColorBrewer scales, from the colorspace package, shipped with ggplot2scale_fill_scico(): from the scico packagescale_fill_gradient(): A two-point gradient scale, default for continuous color scalesscale_fill_gradient2(): A three-point gradient scalescale_fill_gradientn(): An n-point gradient scalescale_fill_discrete(): defaults to scale_fill_hue()/scale_fill_brewer()scale_fill_viridis_d(): derived from the viridis scales, from the colorspace package, shipped with ggplot2scale_fill_brewer(): derived from the ColorBrewer scales, from the colorspace package, shipped with ggplot2scale_fill_scico_d(): from the scico packagescale_fill_hue(): the default scale for discrete colorsscale_fill_grey(): for black and white printscale_fill_binned(): default binned color scale, which defaults to scale_fill_steps().scale_fill_viridis_b(): derived from the viridis scales, from the colorspace package, shipped with ggplot2scale_fill_fermenter(): derived from the ColorBrewer scales, from the colorspace package, shipped with ggplot2scale_fill_steps(): a two-point gradient color scale, the default for binned color scalesscale_fill_steps2(): a three-point gradient color scalescale_fill_stepsn(): an n-point gradient color scaleColor gradients are often used to show the height of a 2d surface. The plots in this section use the surface of a 2d density estimate of the faithful dataset, which records the waiting time between eruptions and duration of each eruption, for the Old Faithful geyser in Yellowstone Park.
erupt <- ggplot(faithfuld, aes(waiting, eruptions, fill = density)) +
geom_raster() +
scale_x_continuous(name = NULL, expand = c(0, 0)) +
scale_y_continuous(NULL, expand = c(0, 0)) +
theme(legend.position = "none")
There are multiple ways to specify continuous color schemes. You can create your own palette, but this section focuses on the many “hand picked” palettes available in ggplot2 and in R.
The viridis scales (viridis_c() for continuous scales, viridis_b() for binned scales, viridis_d() for discrete scales) are designed to be perceptually uniform, and when reduced to black and white, to be perceptible to people with various forms of color blindness.
erupt2 <- erupt + scale_fill_viridis_c()
erupt3 <- erupt + scale_fill_viridis_c(option = "magma")
erupt + erupt2 + erupt3
erupt2 <- erupt + scale_fill_distiller()
erupt3 <- erupt + scale_fill_distiller(palette = "RdPu")
erupt + erupt2 + erupt3
There are many other packages that provide useful color palettes. The scico package is one that provides more palettes that are perceptually uniform and suitable for scientific visualization.
erupt2 <- erupt + scico::scale_fill_scico(palette = "bilbao") ## the default
erupt3 <- erupt + scico::scale_fill_scico(palette = "vik")
erupt4 <- erupt + scico::scale_fill_scico(palette = "lajolla")
erupt2 + erupt3 + erupt4
As there are a great many palette packages in R, a particularly useful package is paletteer, which aims to provide a common interface. The example will look something like this:
erupt + paletteer::scale_fill_paletteer_c("viridis::plasma")
erupt + paletteer::scale_fill_paletteer_c("scico::tokyo")
erupt + paletteer::scale_fill_paletteer_c("gameofthrones::targaryen")
The default scale for continuous fill scales is scale_fill_continuous(), which in turn defaults to scale_fill_gradient(). Gradient scales provide a robust method for creating any color scheme you like. All you need to do is specify two or more reference colors, and ggplot2 will interpolate linearly between them.
There are three functions that you can use for this purpose:
scale_fill_gradient(): produces a two-color gradientscale_fill_gradient2(): produces a three-color gradient with specified midpointscale_fill_gradientn(): produces an n-color gradientThe use of gradient scales is illustrated below.
erupt2 <- erupt + scale_fill_gradient(low = "grey", high = "brown")
erupt3 <- erupt + scale_fill_gradient2(low = "grey", mid = "white", high = "brown", midpoint = 0.02)
erupt4 <- erupt + scale_fill_gradientn(colors = terrain.colors(7))
erupt2 + erupt3 + erupt4
For a two-point gradient scale, generally you want to convey the perceptual impression that the values are sequentially ordered, so you want to keep hue constant, and vary chroma and luminance. The Munsell color system is useful for this, as it provides an easy way of specifying colors based on their hue, chrome and luminance.
The munsell package provides easy access to the Munsell colors. The hue_slice() function plots slices of the Munsell color system where hue is constant. The Munsell colors can can then be used to specify a gradient scale:
# generate a ggplot with hue_slice()
m1 <- munsell::hue_slice()
m2 <- munsell::hue_slice("5P")
erupt2 <- erupt + scale_fill_gradient(low = munsell::mnsl("5P 2/12"), high = munsell::mnsl("5P 7/12"))
m1 + m2 + erupt2
## Warning: Removed 31 rows containing missing values (geom_text).
Three-point gradient scales have slightly different criteria than two-point gradient scales. Typically the goal in such as scale is to convey the perceptual impression that there is a natural midpoint (often a zero value) from which the other values diverge.
If you have colors that are meaningful for your data (e.g., black body colors, or standard terrain colors), or if you’d like to use a palette produced by another package, you may wish to use an n-point gradient.
As an illustration, the plots below use the colorspace package.
erupt2 <- erupt + scale_fill_gradientn(colors = colorspace::rainbow_hcl(7))
erupt3 <- erupt + scale_fill_gradientn(colors = colorspace::heat_hcl(7))
erupt4 <- erupt + scale_fill_gradientn(colors = colorspace::diverge_hcl(7))
erupt2 + erupt3 + erupt4
All continuous color scales have an na.value parameter that controls what color is used for missng values (including values outside the range of the scale limits). By default, it is set to “grey50”, which will standout when you use a colorful scale. If you use a black and white scale, you might want to set it to something else to make it more obvious.
You can set na.value = NA to make missing values invisible, or choose a specific color if you prefer:
df <- data.frame(x = 1, y = 1:5, z = c(1, 3, 2, NA, 5))
p1 <- ggplot(df, aes(x, y)) +
geom_tile(aes(fill = z), size = 4) +
labs(x = NULL, y = NULL)
p2 <- p1 + scale_fill_gradient(na.value = NA)
p3 <- p1 + scale_fill_gradient(na.value = "yellow")
p1 + p2 + p3
Discrete color/fill scales occur in many situations, such as bar charts, scatterplots.
The default scale for discrete colors is scale_fill_discrete(), which in turn defaults to scale_fill_hue(). So the following three plots are identical:
bars <- ggplot(df, aes(x, y, fill = z)) +
geom_bar(stat = "identity")
bars + scale_fill_discrete()
bars + scale_fill_hue()
The default scale has some limitations. The following sections describe some nicer discrete palettes.
The default color scheme scale_fill_hue() picks evenly spaced hues around the HCL color wheel. This works well for up to about eight colors, but after that it becomes hard to tell the different colors apart.
With scale_fill_hue(), you can control the default chroma and luminance, and the range of hues, with the h, c and l arguments.
df <- data.frame(y = c(2, 3, 1, 4), x = c("a", "b", "c", "d"))
bars <- ggplot(df, aes(x, y)) +
geom_bar(aes(fill = x), stat = "identity") +
theme(legend.position = "none") +
labs(x = NULL)
p2 <- bars + scale_fill_hue(c = 40)
p3 <- bars + scale_fill_hue(h = c(150, 300))
bars + p2 + p3
However, one disadvantage of the default color scheme is that because the colors all have the same luminance and chroma, when you print them in black and white, they all appear as an identical shade of grey. Noting this, if you are intending a discrete color scale to be printed in black and white, it is better to use scale_fill_grey() , which maps discrete to grays from light to dark:
p1 <- bars + scale_fill_grey()
p2 <- bars + scale_fill_grey(start = 0.5, end = 1)
p3 <- bars + scale_fill_grey(start = 0, end = 0.5)
p1 + p2 + p3
scale_fill_brewer() is a discrete color scale that - along with the continuous analog scale_fill_distiller() and binned analog scale_fill_fermenter() - uses handpicked “ColorBrewer” colors taken from http://colorbrewer2.org/. These colors have been designed to work well in a wide variety of situations, although the focus is on maps and so the colors tend to work better when displayed in large areas.
To view the different palettes inside the RColorBrewer package:
RColorBrewer::display.brewer.all()
There are 3 types of palettes:
scale_color_distiller() (see Section 3.3.3.1.2).No palette is uniformly good for all purposes. Scatter plots typically use small plot markers, and bright colors tend to work better than subtle ones:
df <- data.frame(x = 1:3 + runif(30), y = runif(30), z = c("a", "b", "c"))
point <- ggplot(df, aes(x, y)) +
geom_point(aes(color = z), size = 2) +
theme(legend.position = "none") +
labs(x = NULL, y = NULL)
p1 <- point + scale_color_brewer(palette = "Set1")
p2 <- point + scale_color_brewer(palette = "Set2")
p3 <- point + scale_color_brewer(palette = "Pastel1")
p1 + p2 + p3
Bar plots usually contain large patches of color, and bright colors can be overwhelming. Subtle colors tend to work better in this situation:
df <- data.frame(x = 1:3, y = 3:1, z = c("a", "b", "c"))
bar <- ggplot(df, aes(x, y, fill = z)) +
geom_bar(stat = "identity") +
theme(legend.position = "none") +
labs(x = NULL, y = NULL)
p1 <- bar + scale_fill_brewer(palette = "Set1")
p2 <- bar + scale_fill_brewer(palette = "Set2")
p3 <- bar + scale_fill_brewer(palette = "Pastel1")
p1 + p2 + p3
If none of the hand-picked palettes is suitable, or if you have your own preferred colors, you can use scale_fill_manual() to set the colors manually.
This can be useful if you wish to choose colors that highlight a secondary grouping structure or draw attention to different comparisons:
df <- data.frame(y = c(2, 3, 1, 4), x = c("a", "b", "c", "d"))
bars <- ggplot(df, aes(x, y)) +
geom_bar(aes(fill = x), stat = "identity") +
theme(legend.position = "none") +
labs(x = NULL)
p1 <- bars + scale_fill_manual(values = c("sienna1", "sienna4", "hotpink1", "hotpink4"))
p2 <- bars + scale_fill_manual(values = c("tomato1", "tomato2", "tomato3", "tomato4"))
p3 <- bars + scale_fill_manual(values = c("grey", "black", "grey", "grey"))
p1 + p2 + p3
You can also use a named vector to assign colors to each discrete level, which allows you to specify the levels in any order you like:
bars + scale_fill_manual(values = c(
"d" = "grey",
"a" = "grey",
"c" = "black",
"b" = "grey"
))
Color scales can also come in binned versions. The default scale is scale_fill_binned(), which in turn defaults to scale_fill_steps(). As with the binned position scales, the binned color scales have an n.breaks argument that controls the number of discrete color categories created by the scale.
Counterintuitively - because the human visual system is very good at detecting edges - this can sometimes make a continuous color gradient easier to perceive:
erupt2 <- erupt + scale_fill_binned() ## default to `scale_fill_steps()`
erupt3 <- erupt + scale_fill_steps()
erupt4 <- erupt + scale_fill_binned(n.breaks = 8)
erupt2 + erupt3 + erupt4
scale_fill_fermenter() is a brewer analog for binned scales.
erupt2 <- erupt + scale_fill_fermenter(n.breaks = 9)
erupt3 <- erupt + scale_fill_fermenter(n.breaks = 9, palette = "Oranges")
erupt4 <- erupt + scale_fill_fermenter(n.breaks = 9, palette = "PuOr")
erupt2 + erupt3 + erupt4
Note that like the discrete scale_fill_brewer() - and unlike the continuous scale_fill_distiller() - the binned function scale_fill_fermenter() does not interpolate between the brewer colors, and if you set n.breaks larger than the number of colors in the palette a warning message will appear and some colors will not be displayed.
For this topic, you can refer to Section 3.3.3.2.
In short, scale_fill_steps() is analogous to scale_fill_gradient(), and allows you to construct your own two-color gradients. There is also a three-color variant scale_fill_steps2() and n-color scale variant scale_fill_stepsn() that behave similarly to their continuous counterparts:
erupt2 <- erupt + scale_fill_steps(low = "grey", high = "brown")
erupt3 <- erupt + scale_fill_steps2(low = "grey", mid = "white", high = "brown", midpoint = 0.02)
erupt4 <- erupt + scale_fill_stepsn(n.breaks = 12, colors = terrain.colors(12))
erupt2 + erupt3 + erupt4
The size aesthetic is typically used to scale points and text.
The default scale for size aesthetics is scale_size() in which a linear increase in the variable is mapped onto a linear increase in the area (not the radius) of the geom. Scaling size as a function of area is a sensible default as human perception of size is more closely mimicked by area scaling than by radius scaling.
By default, the smallest value in the data (more precisely in the scale limits) is mapped to a size of 1 and the largest is mapped to a size of 6. The range argument allows you to scale the size of the geoms:
p1 <- ggplot(mpg, aes(displ, hwy, size = cyl)) +
geom_point()
p2 <- p1 + scale_size(range = c(1, 2))
p1 + p2
There are (rare) situations where area scaling is undesirable, and for such situations the scale_radius() function is provided.
Other size scales exist and are worth noting briefly:
scale_size_binned(): a size scale that behaves like scale_size() but maps continuous values onto discrete size categories, analogous to the binned position and color scales.scale_size_area(), scale_size_binned_area(): versions of scale_size() and scale_size_binned() that ensure a value of 0 maps to an area of 0.scale_size_date(), scale_size_datetime(): designed to handle date data, analogous to the date scales.scale_size_manual(): has a values argument which allows you to specify user-defined sizes.scale_shape(): maps discrete variables to size easily discernible shapes. If you have more than six levels, you will get a warning, and the seventh and subsequent levels will not appear on the plot.scale_shape_binned(): You cannot map a continuous variable to shape unless scale_shape_binned() is used. But since shape has no inherent order, this use is not advised.scale_shape_manual(): has a values argument which allows you to specify user-defined shapes manually.The only option is na.value.
scale_linetype():scale_linetype_binned():scale_linetype_continuous():scale_linetype_discrete():Alpha scales map the transparency of a shade to a value in the data. They are not often useful, but can be a convenient way to visually down-weight less important observations.
Here is a list of alpha scales:
scale_alpha_continuous(): a.k.a scale_alpha()scale_alpha_binned():scale_alpha_discrete():scale_alpha_ordinal():scale_alpha_manual():scale_alpha() is an alias for scale_alpha_continuous() since that is the most common use of alpha, and it saves a bit of typing.
Identity scales, such as scale_color_identity() and scale_shape_identity(), are used when your data is already scaled such that the data space and aesthetic space are the same. Section 7.2.2 has an example using scale_color_identity().
The code below is another example where the identity scale is useful. luv_colours contains the locations of all R’s built-in colors in the LUV color space (the space that HCL is based on). A legend is unnecessary, because the point color represents itself: the data and aesthetic spaces are the same.
str(luv_colours)
## 'data.frame': 657 obs. of 4 variables:
## $ L : num 9342 9101 8810 8935 8452 ...
## $ u : num -3.37e-12 -4.75e+02 1.01e+03 1.07e+03 1.01e+03 ...
## $ v : num 0 -635 1668 1675 1610 ...
## $ col: chr "white" "aliceblue" "antiquewhite" "antiquewhite1" ...
ggplot(luv_colours, aes(u, v)) +
geom_point(aes(color = col), size = 3) + ## col is discrete
scale_color_identity() +
coord_equal()
In ggplot2, legend and axes are known collectively as guides. You might find it surprising that axes and legends are the same type of th8ing, but while they look very different, they hae the same purpose: allow you to read observations from the plot and map them back to their original values.
| Argument name | Axis | Legend |
|---|---|---|
name |
Label | Title |
breaks |
Ticks & grid line | Key |
labels |
Tick label | Key label |
In ggplot2, guides are automatically produced based on the layers in your plot. You don’t directly control the legends and axes; instead you set up the data to that there’s a clear mapping between data and aesthetics, and a guide is generated for you.
This is very different from base R graphics, where you have total control over the legend, and can be frustrating when your first start using ggplot2. However, once you get the hang of it, you’ll find it saves you time, and there is little you cannot do.
name and breaks argumentsThe name argument to a scale function governs axis titles and legend titles. Specifically, the name argument to a position scale governs the axis titles, and the name argument to other scale functions (such as color/fill scales, size scales) governs the legend titles.
In the same way that the name argument to a scale function governs axis titles and legend titles, the breaks argument controls which values appear as tick marks on axes, and as keys on legends.
For axes:
toy <- data.frame(
const = 1,
up = 1:4,
txt = letters[1:4],
big = (1:4) * 1000,
log = c(2, 5, 10, 2000)
)
axs <- ggplot(toy, aes(big, const)) +
geom_point(size = 4)
p2 <- axs + scale_x_continuous(name = "User-defined axis title", breaks = c(2000, 2500, 4000))
axs + p2
For legends:
lgd <- ggplot(toy, aes(up, up, fill = big)) +
geom_tile() +
labs(x = NULL, y = NULL)
p2 <- lgd + scale_fill_continuous(name = "User-defined legend title", breaks = c(2000, 2500, 4000))
lgd + p2
You can suppress the breaks entirely by setting them to NULL. For axes, this removes the tick marks, grid lines, and tick labels; and for legends, this removes the keys and key labels, i.e. the entire legend.
base <- ggplot(toy, aes(up, up, fill = big)) +
geom_tile()
p1 <- base +
theme(legend.background = element_rect(color = "red"), plot.background = element_rect(color = "green"))
p2 <- p1 + scale_fill_continuous(breaks = NULL)
p1 + p2
guide argument and guide functionsAnother way to modify the behavior of axes and legends is with the guide argument of the relevant scale function or - perhaps more conveniently - the guides() helper function. (See Section 7.3.2 for details)
Scale guides work in a similar way to scale names, but are more complex than scale names: where the name argument (and labs()) takes text as input, the guide argument (and guides()) takes a guide object created by a guide function, such as guide_colorbar() and guide_legend(). The arguments to these functions offer additional fine control over the guide.
base <- ggplot(mpg, aes(displ, hwy, color = cyl)) +
geom_point() ## default is `colorbar` for continuous color scales
p2 <- base + scale_color_continuous(guide = guide_colorsteps()) ## using `guide` argument
p2a <- base + scale_color_continuous(guide = "colorsteps") ## equivalent to p2
p3 <- base + guides(color = guide_colorsteps()) ## using `guide()` function
base + p3 + p2 + p2a + plot_layout(byrow = TRUE)
The table below summarizes the default guide functions associated with different scale types:
| Scale type | Default guide type |
|---|---|
| continuous scales for color/fill aesthetics | colorbar |
| binned scales for color/fill aesthetics | colorsteps |
| position scales (continuous, binned and discrete) | axis |
| discrete scales (except position scales) | legend |
| binned scales (except position/color/fill scales) | bins |
guide_legend()The legend guide displays individual keys in a table. The most useful options are:
nrow, ncol: specify the dimensions of the table.byrow: FALSE by defaultreverse: logical, reverses the order of the keys.override.aes: list, useful when you want the elements in the legend display differently than the geoms in the plot. This is often required when you’ve used transparency or size to deal with moderate overplotting and also used color in the plot. See the example below.keywidth and keyheight (along with default.unit): allow you to specify the size of the keys. These are grid units, e.g. unit(1, "cm").base <- ggplot(mpg, aes(displ, hwy, color = drv)) +
geom_point(size = 3, alpha = .2, stroke = 0)
p1 <- base + guides(color = guide_legend())
p2 <- base + guides(color = guide_legend(override.aes = list(alpha = 1)))
p1 + p2
guide_bins()guide_bins() is suited to the situation when a continuous variable is binned and then mapped to an aesthetic that produces a legend, such as size, shape, color and fill.
Unlike guide_legend(), the guide created for a binned scale by guide_bins() does not organize the individual keys into a table. Instead, they are arranged in a column (or row) along a single vertical (or horizontal) axis, which be default is displayed with its own axis.
The important arguments to guide_bins() are:
axis: logical indicating whether the axis should be drawn.direction: character specifying direction of the guide (“horizontal” or “vertical”)show.limits: logical, whether tick marks are shown at the ends of the guide axis.axis.color, axis.linewidth, axis.arrow: control the guide axis displaykeywidth, keyheight, reverse, override.aes: same behavior as in guide_legend() (Section 3.9.2.1)base <- ggplot(mpg, aes(displ, manufacturer)) +
geom_point(aes(size = hwy), alpha = .3) +
scale_size_binned(n.breaks = 10)
p1 <- base + guides(size = guide_bins(axis = FALSE))
p2 <- base + guides(size = guide_bins(show.limits = TRUE))
p3 <- base + guides(size = guide_bins(axis.colour = "red",
axis.arrow = arrow(
length = unit(.1, "inches"),
ends = "first",
type = "closed"
)))
p1 + p2 + p3
guide_colorbar()The color bar guide is designed for continuous ranges of colors. It outputs a rectangle over which the color gradient varies.
The important arguments are:
barwidth, barheight: specify the size of the bar. These are grid units, e.g. unit(1, "cm").nbin: controls the number of slices. You may want to increase this from the default value of 20 if you draw a very long bar.reverse: flips the color bar to put the lowest values at the topbase <- ggplot(mpg, aes(cyl, displ, color = hwy)) +
geom_point(size = 4)
p2 <- base + guides(color = guide_colorbar(reverse = TRUE))
p3 <- base + guides(color = guide_colorbar(barheight = unit(2, "cm")))
base + p2 + p3
guide_colorsteps()The “color steps” guide is a version of guide_colorbar() appropriate for binned color and fill scales. It shows the areas between breaks as a single constant color, rather than displaying a color gradient that varies smoothly along the bar.
Arguments mostly mirror those for guide_colorbar(). The additional arguments are as follows: - show.limits: logical, indicating whether values should be shown at the ends of the stepped color bar (analogous to the corresponding argument in guide_bins()). - ticks: logical, indicating whether tick marks should be displayed adjacent to the key labels - even.steps: logical, indicating whether bins should be evenly spaced or proportional in size to their frequency in the data.
base <- ggplot(mpg, aes(displ, hwy, color = cyl)) +
geom_point(size = 2) +
scale_color_binned()
p1 <- base + guides(color = guide_colorsteps(show.limits = TRUE))
p2 <- base + guides(color = guide_colorsteps(ticks = TRUE))
p1 + p2
This section focuses on legends because they are more complicated than axes:
By default, a layer will only appear in the legend if the corresponding aesthetic is mapped to a variable with aes().
As a general principle, a number of settings that affect the overall display of legends are controlled through the theme system, which can be modified with the theme() function.
The position and justification of legends is controlled by the theme setting legend.position, which takes values “right”, “left”, “top”, “bottom”, or “none”. Switching between left/right and top/bottom modifies how the keys in each legend are laid out (horizontally or vertically), and how multiple legends are sacked (horizontally or vertically). If needed, you can adjust those options independently:
legend.direction: layout of items in legend (“horizontal” or “vertical”)legend.box: arrangement of multiple legends (“horizontal” or “vertical”)legend.box.just: justification of each legend within the overall bounding box, when there are multiple legends (“top”, “bottom”, “left”, “right”)Alternatively, if there’s a lot of blank space in your plot you might want to place the legend inside the plot. You can do this by setting legend.postion to a numeric vector of length two. The numbers represent a relative location in the panel area: c(0, 1) is the top-left corner and c(1, 0) is the bottom-right corner. You control which corner of the legend (anchor point) the legend.position refers to with legend.justification, which is specified in a similar way.
base <- ggplot(toy, aes(up, up)) +
geom_point(aes(color = txt), size = 3)
## place top-left of the legend to top-left of the panel
p1 <- base + theme(legend.position = c(0, 1), legend.justification = c(0, 1))
## place the center of the legend to the center of the panel
p2 <- base + theme(legend.position = c(0.5, 0.5), legend.justification = c(0.5, 0.5))
base + p1 + p2
Additionally, there is a margin around the legends, which you can suppress with legend.margin = unit(0, "mm").
In most cases the default glyphs shown in the legend key will be appropriate to the layer and the aesthetic. Should you need to override this behavior, the key_glyph argument can be used to associate a particular layer with a different kind of glyph. key_glyph is an argument to layer(), and can be specified in the geom_ function.
Internally, each geom is associated with a key drawing function such as draw_key_path(), draw_key_boxplot(), which is responsible for drawing the key when the legend is created. You can pass to key_glyph the key drawing function or a text string with the value being the key drawing function name minus the draw_key_ prefix (no parentheses in order to use default data).
draw_key_point(data, params, size):draw_key_abline(data, params, size):draw_key_rect():draw_key_polygon():draw_key_blank():draw_key_boxplot():draw_key_crossbar():draw_key_path():draw_key_vpath():draw_key_dplot():draw_key_pointrange():draw_key_smooth():draw_key_text():draw_key_label():draw_key_vline():draw_key_timeseries():base <- ggplot(economics, aes(date, psavert, color = "savings"))
p1 <- base + geom_line()
p2 <- base + geom_line(key_glyph = "timeseries")
p3 <- base + geom_line(key_glyph = draw_key_timeseries)
p1 + p2 + p3
Merging legends occurs quite frequently when using ggplot2.
By default, a layer will only appear in the legend if the corresponding aesthetic is mapped to a variable with aes(). You can override whether or not a layer appears in the legend with show.legend: FALSE to prevent a layer from ever appearing in the legend; TRUE to force it to appear when it otherwise wouldn’t. (show.legend is an argument to layer().)
Using show.legend = TRUE can be useful in conjunction with the following trick to make points stand out:
p1 <- ggplot(toy, aes(up, up)) +
geom_point(size = 4, color = "grey20") +
geom_point(aes(color = txt), size = 2)
p2 <- ggplot(toy, aes(up, up)) +
geom_point(size = 4, color = "grey20", show.legend = TRUE) +
geom_point(aes(color = txt), size = 2)
p1 + p2
ggplot2 tries to use the fewest number of legends to accurately convey the aesthetics used in the plot. It does this by combining legends where the same variable is mapped to different aesthetics.
base <- ggplot(toy, aes(const, up)) +
scale_x_continuous(NULL, breaks = NULL)
p1 <- base + geom_point(aes(color = txt))
p2 <- base + geom_point(aes(shape = txt))
p3 <- base + geom_point(aes(color = txt, shape = txt))
p1 + p2 + p3
In order for legends to be merged, they must have the same name. So if you change the name of one of the scales, you’ll need to change it for all of them. One convenient way to do this is by using the labs() helper function.
Splitting a legend is a much less common data visualization task. By default, ggplot2 does not allow you to “split” the color aesthetic into multiple scales with separate legends. Nevertheless, there are exceptions to this general rule, and it is possible to override this behavior with the ggnewscale package.
Coordinate systems have two main jobs:
x and y, but they might be better called position 1 and position 2, because their meaning depends on the coordinate system used. For example, with the polar coordinate system, they become angle and radius (or radius and angle), with maps they become latitude and longitude.There are two types of coordinate systems:
coord_cartesian(): the default Cartesian coordinate systemcoord_flip(): Cartesian coordinate system with x and y axes flippedcoord_fixed(): Cartesian coordinate system with a fixed aspect ratiocoord_map()/coord_quickmap()/coord_sf(): Map projectionscoord_polar(): Polar coordinatescoord_trans(): Apply arbitrary transformations to x and y positions, after the dat has been processed by the stat.coord_cartesian()coord_cartesian() has arguments xlim and ylim. If you think back to the scales, you’ll recall that the scales also have a limits argument. The key difference is how the limits work: when setting scale limits, any data outside the limits is thrown away; but when setting coordinate system limits, we still use all the data, but we only display a small region of the plot, i.e. zooming into a plot.
coord_flip()Most stats and geoms assume you are interested in y values conditional on x values: in most statistical models, the x values are assumed to be measure without error. If you are interested in x conditional on y (or you just want to rotate the plot 90 degrees), you can ues coord_flip() to exchange the x and y axes. Compare this with just exchanging the variables mapped to x and y:
p1 <- ggplot(mpg, aes(displ, cty)) +
geom_point() +
geom_smooth()
p2 <- ggplot(mpg, aes(cty, displ)) +
geom_point() +
geom_smooth()
p3 <- p1 + coord_flip()
p1 + p2 + p3
coord_fixed()coord_fixed() fixes the ratio of length on th x and y axes. The default ratio ensures that the x and y axes have equal scales, i.e. one unit on the x axis represents the same range of data as one unit on the y axis. The aspect ratio will also be set to ensure that the mapping is maintained regardless of the shape of the output device.
Unlike linear coordinate systems, non-linear coordinate systems can change the shape of geoms.
coord_trans()Like limits, we also transform data in two places: at the scale level or at the coordinate system level. coord_trans() has arguments x and y which should be strings naming the transformer or transformer objects (see Section 3.2.2.6).
Similarly to limits, transforming at the scale level occurs before statistics are computed. Therefore it does not change the shape of the geom. Transforming at the coordinate system level occurs after the statistics have been computed. Also it does affect the shape of the geom. Using both together allows us to model the data on a transformed scale and then backtransform it for interpretations: a common pattern in analysis.
## Linear model on original scale is poor fit
base <- ggplot(diamonds, aes(carat, price)) +
stat_bin2d() + ## `geom = "tile"` by default
geom_smooth(method = "lm") +
theme(legend.position = "none")
## Better fit on log scale, but harder to interpret
p2 <- base + scale_x_log10() + scale_y_log10()
## Fit on log scale, then backtransform to original
pow10 <- scales::exp_trans(10)
p3 <- p2 + coord_trans(x = pow10, y = pow10)
base + p2 + p3
coord_polar()Using polar coordinate systems gives rise to pie charts and wind roses (from bar geoms), and radar charts (from line geoms). Polar coordinates are often used for circular data, particularly time or direction, but the perceptual properties are not good because the angle is harder to perceive for small radii than it is for large radii. The theta argument determines which position variable is mapped to angle (by default, x) and which to radius.
## stacked bar chart
base <- ggplot(mtcars, aes(factor(1), fill = factor(cyl))) +
geom_bar(width = 1) +
theme(legend.position = "none") +
scale_x_discrete(NULL, expand = c(0, 0)) +
scale_y_continuous(NULL, expand = c(0, 0))
## pie chart
p2 <- base + coord_polar(theta = "y")
## bullseye chart
p3 <- base + coord_polar()
base + p2 + p3
coord_map()Maps are intrinsically displays of spherical data. Simply plotting raw longitudes and latitudes is misleading, so we must project the data. There are two ways to do this in ggplot:
coord_quickmap(): a quick and dirty approximation that sets the aspect ratio to ensure that 1m of latitude and 1m of longitude are the same distance in the middle of the plot. This is a reasonable place to start for smaller regions, and is very fast.coord_map(): uses the mapproj package to do a formal map projection. It takes the same arguments as mapproj::mapproject() for controlling the projection. It is uch slower than coord_quickmap() because it must munch the data and transform each piece.There are three types of faceting:
facet_null(): a single plot, the defaultfacet_wrap(): “wraps” a 1d ribbon of panels into 2dfacet_grid(): produces a 2d grid of panels defined by variables which form the rows and columns. Compared to “wrap”, “grid” is fundamentally 2d.These section uses a subset of the mpg dataset:
mpg2 <- subset(mpg, cyl != 5 & drv %in% c("4", "f") & class != "2seater")
facet_wrap()facet_wrap() makes a long ribbon of panels (generated by any number of variables) and wraps it into 2d (i.e. a grid). This is useful if you have a single variable with many levels and want to arrange the plots in a more space efficient manner.
You can control how the ribbon is wrapped into a grid with ncol, nrow, dir. dir controls the direction of wrap: “h” or “v”. The as.table argument controls whether the facets are laid out like a table (TRUE), with the highest values at the bottom-right, or a plot (FALSE), with the highest values at the top-right.
base <- ggplot(mpg2, aes(displ, hwy)) +
geom_blank() +
labs(x = NULL, y = NULL)
p1 <- base + facet_wrap(~class, nrow = 3)
p2 <- base + facet_wrap(~class, nrow = 3, as.table = FALSE)
p1 + p2
facet_grid()facet_grid() lays out plots in a 2d grid, as defined by a formula:
. ~ a: spreads the values of a across the columns.b ~ .: spreads the values of b down the rows.a ~ b: spreads a across columns and b down rows.You can use multiple variables in the rows or columns, by “adding” them together, e.g. a + b ~ c + d. Variables appearing together on the rows or columns are nested in the sense that only combinations that appear in the data will appear in the plot. Variables that are specified on rows and columns will be crosses: all combinations will be shown, including those that didn’t appear in the original data set - this may result in empty panels.
For both facet_wrap() and facet_grid(), you can control whether the position scales are the same in all panels (fixed) or allowed to vary between panels (free), with the scales argument:
scales = "fixed": x and y scales are fixed across all panelsscales = "free_x": the x scale is free, and the y scale is fixedscales = "free_y": the y scale is free, and the x scale is fixedscales = "free": x and y scales vary across panelsfacet_grid() imposes an additional constraint on the scales: all panels in a column must have the same x scale, and all panels in a row must have the same y scale. This is because each column shares the an x axis, and each row shares a y axis.
Fixed scales make it easier to see patterns across panels; free scales make it easier to see patterns within panels.
p <- ggplot(mpg2, aes(cty, hwy)) +
geom_abline() +
geom_jitter()
p1 <- p + facet_wrap(~cyl)
p2 <- p + facet_wrap(~cyl, scales = "free")
p1 / p2
Free scales are also useful when we want to display multiple time series that were measured on different scales (different y scales). To do this, we first need to change from ‘wide’ to ‘long’ data, stacking the separate variables into a single column.
facet_grid() has an additional argument called space, which takes the same values as scales. When space is “free”, each column (or row) will have width (or height) proportional to the range of the scale for that column (or row). This is most useful for categorical scales, where we can assign space proportionally based on the number of levels in each facet, as illustrated below:
ggplot(mpg2, aes(cty, model)) +
geom_point() +
facet_grid(manufacturer ~ ., scales = "free", space = "free") +
theme(strip.text.y = element_text(angle = 0))
If you use faceting on a plot with multiple datasets (on different layers), what happens when one of those datasets do not contain the faceting variables? Missing faceting variables are treated like they have all the values. Therefore it gets displayed in every facet.
This situation commonly arises when you are adding contextual information that should be the same in all panels. The technique is particularly useful when you add annotations to make it easier to compare between facets. See the example in Section 5.5.
Faceting is an alternative to using aesthetics (like color, size, shape) to differentiate groups.
Both techniques have strengths and weaknesses, based around the relative positions of the subsets. With faceting, each group is quite far apart in its own panel, and there is no overlap between groups. This is good if the groups overlap a lot, but it does make small difference harder to see. When using aesthetics, the groups are close together and may overlap, but small differences are easier to see.
Comparisons between facets often benefit from some thoughtful annotations. For example, we could show the mean of each group in every panel. To do this, we group and summarize the data using the dplyr package. Note that we need two “z” variables: one for the facets and one for the colors.
df <- data.frame(x = rnorm(90), y = runif(90), z = letters[1:3])
df_sum <- df %>%
group_by(z) %>%
summarise(x = mean(x), y = mean(y)) %>%
rename(z2 = z)
ggplot(df, aes(x, y)) +
geom_point() +
geom_point(data = df_sum, aes(color = z2), size = 4) + ## z does not exist in df_sum
facet_wrap(~z)
Another useful technique is to pull all the data in the background of each panel:
df2 <- dplyr::select(df, -z)
ggplot(df, aes(x, y)) +
geom_point(data = df2, color = "grey70", size = 4) +
geom_point(aes(color = z), size = 4) +
facet_wrap(~z)
To facet continuous variables, you must first discretise them. ggplot2 provides three helper functions to do so:
n bins, each of the same length: cut_interval(x, n)width: cut_width(x, width)n bins, each containing (approximately) the same number of points: cut_number(x, n = 10)They are illustrated below:
## bins of width 1
mpg2$displ_w <- cut_width(mpg2$displ, 1)
## six bins containing equal number of points
mpg2$displ_n <- cut_number(mpg2$displ, 6)
## six bins of equal length
mpg2$displ_i <- cut_interval(mpg2$displ, 6)
ggplot(mpg2, aes(cty, hwy)) +
geom_point() +
facet_wrap(~displ_i, nrow = 1) +
labs(x = NULL, y = NULL)
The ggplot2 theme system allows you to exercise fine control over the non-data elements of your plot. Themes give you control over things such as fonts, ticks, panel strips, and backgrounds.
The theming system is composed of four main components:
plot.title element controls the appearance of the plot title; axis.ticks.x, the ticks on the x axis; legend.key.height, the height of the keys in the legend.element_text() sets the font size, color, and face of text elements like plot.title.theme() function which allows you to override the default theme elements by calling element functions, like theme(plot.title = element_text(color = "red")).theme_grey() set all of the theme elements to values designed to work together harmoniously.ggplot2 comes with a number of built in themes. The most important is theme_grey(), the signature ggplot2 theme with a light grey background and white grid lines.
There are 9 other themes built in ggplot2 1.1.0:
theme_bw(): a variation on theme_grey() that uses a white background and thin grey grid lines.theme_linedraw(): a theme with only black lines of various widths on white backgrounds, reminiscent of a line drawing.theme_light(): similar to theme_linedraw() but with light grey lines and axes, to direct more attention towards the data.theme_dark(): the dark cousin of theme_light(), with similar line sizes but a dark background.theme_minimal(): a minimalistic theme with no background annotations.theme_classic(): a classic-looking theme, with x and y axis lines and no grid lines.theme_void(): a completely empty theme.theme_test(): a theme for visual unit tests. It should ideally never change except for new features.base <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
labs(x = NULL, y = NULL)
p1 <- base + theme_grey() + ggtitle("theme_grey()")
p2 <- base + theme_bw() + ggtitle("theme_bw()")
p3 <- base + theme_linedraw() + ggtitle("theme_linedraw()")
p4 <- base + theme_light() + ggtitle("theme_light()")
p5 <- base + theme_dark() + ggtitle("theme_dark()")
p6 <- base + theme_minimal() + ggtitle("theme_minimal()")
p7 <- base + theme_classic() + ggtitle("theme_classic()")
p8 <- base + theme_void() + ggtitle("theme_void()")
p9 <- base + theme_test() + ggtitle("theme_test()")
p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8 + p9 + plot_layout(ncol = 3)
All themes have a base_size parameter which controls the base font size. The base font size is the size that the axis titles use. The plot title is usually bigger (1.2x), and the tick and strip labels are smaller (0.8x).
As well as applying themes one plot at a time, you can change the default theme with theme_set() for all plots, e.g. theme_set(theme_bw()).
You are not limited to the themes built in to ggplot2. Other packages, like ggthemes add even more.
The complete themes are a great place to start but don’t give you a lot of control. To modify individual elements, you need to use theme() to override the default setting for an element with an element function.
To modify an individual theme component, you use code like plot + theme(element.name = element_function()).
There are 4 basic types of built-in element functions: text, lines, rectangles, and blank:
element_text(): draws labels and headings. You can control the font family, face, color, size (in points), hjust, vjust, angle (in degrees), and lineheight (as ratio of fontface). More details on the parameters can be found in vignette("ggplot2-specs"). You can control the margins around the text with the margin argument and margin() function. margin() has four arguments: the amount of space (in points) to add to the top, right, bottom and left sides of the text. Any elements not specified default to 0.element_line(): draws lines, parameterised by color, size and linetype.element_rect(): draws rectangles, mostly used for backgrounds, parameterised by fill color and border color, size, and linetype.element_blank(): draws nothing. Use this if you don’t want anything drawn, and no space allocated for that element. If you don’t want the space to be claimed (perhaps because they need to line up with other plots on the page), use color = NA, fill = NA to create invisible elements that still take up space.unit(1, "cm") or unit(0.25, "inch").To modify them elements for all future plots, use theme_update(). It returns the previous theme settings, so you can easily restore the original parameters once you’re done.
old_theme <- theme_update(
plot.background = element_rect(fill = "lightblue3", color = NA),
panel.background = element_rect(fill = "lightblue", color = NA),
axis.text = element_text(color = "linen"),
axis.title = element_text(color = "linen")
)
base
theme_set(old_theme)
base
There are around 40 unique elements that control the appearance of the plot. They can be roughly grouped into five categories: plot, axis, legend, panel, and facet.
Some elements that affect the plot as a whole:
| Element | Setter | Description |
|---|---|---|
| plot.background | element_rect() |
plot background (border etc.) |
| plot.title | element_text() |
plot title |
| plot.margin | margin() |
margins around plot |
The axis elements control the appearance of the axes:
| Element | Setter | Description |
|---|---|---|
| axis.line | element_line() |
line parallel to axis |
| axis.text | element_text() |
tick labels |
| axis.text.x | element_text() |
x-axis tick labels |
| axis.text.y | element_text() |
y-axis tick labels |
| axis.title | element_text() |
axis titles |
| axis.title.x | element_text() |
x-axis title |
| axis.title.y | element_text() |
y-axis title |
| axis.ticks | element_text() |
axis tick marks |
| axis.ticks.length | unit() |
length of tick marks |
Note that axis.text and (axis.title) comes in three forms: axis.text, axis.text.x, and axis.text.y. Any properties that you don’t explicitly set in axis.text.x and axis.text.y will be inherited from axis.text.
The most common adjustment is to rotate the x-axis tick labels to avoid long overlapping labels. If you do this, note that negative angles tend to look best and you should set hjust = 0 and vjust = 1.
The legend elements control the appearance of all legends. You can also modify the appearance of individual legends by modifying the same elements in guide_legend() or guide_colorbar().
| Element | Setter | Description |
|---|---|---|
| legend.background | element_rect() |
legend background |
| legend.key | element_rect() |
background of legend keys |
| legend.key.size | unit() |
legend key size |
| legend.key.height | unit() |
legend key height |
| legend.key.width | unit() |
legend key width |
| legend.margin | unit() |
legend margin |
| legend.text | element_rect() |
legend labels |
| legend.text.align | 0-1 | legend label align (0 = right, 1 = left) |
| legend.title | element_text() |
legend title |
| legend.title.align | 0-1 | legend label align (0 = right, 1 = left) |
| legend.position | “none”,“left”, “right”, “top”, “bottom” | legend position |
| legend.direction | “horizontal”, “vertical” | legend direction |
| legend.justification | two-element numeric vector | anchor point of legend |
| legend.box | “horizontal”, “vertical” | arrangement of multiple legends |
Panel elements control the appearance of the plotting panels:
| Element | Setter | Description |
|---|---|---|
| panel.background | element_rect() |
panel background (under data) |
| panel.border | element_rect() |
panel border (over data) |
| panel.grid.major | element_line() |
major grid lines |
| panel.grid.major.x | element_line() |
vertical major grid lines |
| panel.grid.major.y | element_line() |
horizontal major grid lines |
| panel.grid.minor | element_line() |
minor grid lines |
| panel.grid.minor.x | element_line() |
vertical minor grid lines |
| panel.grid.minor.y | element_line() |
horizontal minor grid lines |
| aspect.ratio | numeric | plot aspect ratio |
The main difference between panel.background and panel.border is that the background is drawn underneath the data, and the border is drawn on top of it. For this reason, you’ll always need to assign fill = NA when overriding panel.border.
Note that aspect.ratio controls the aspect ratio of the panel, not the overall plot.
The following theme elements are associated with faceted ggplots:
| Element | Setter | Description |
|---|---|---|
| strip.background | element_rect() |
background of panel strips |
| strip.text | element_text() |
strip text |
| strip.text.x | element_rect() |
horizontal strip text |
| strip.text.y | element_rect() |
vertical strip text |
| panel.margin | unit() |
margin between facets |
| panel.margin.x | unit() |
margin between facets (vertical) |
| panel.margin.y | unit() |
margin between facets (horizontal) |
Elements strip.text.x affects both facet_wrap() or facet_grid(); strip.text.y only affects facet_grid().
A range of packages have the functionalities of arranging heterogeneous plots, for example, patchwork, cowplot, gridExtra, ggpubr. This chapter will focus on the patchwork package only.
Example operators and functions of patchwork:
+: add plots together. Note that in the absence of a layout, the same algorithm that governs the number of rows and columns in facet_wrap() will decide the number of rows and columns, i.e. create 1x3 grid when adding 3 plots, and 2x2 grid when adding 4 plots.plot_layout(): control the layout - the number of rows and columns./: shortcut for setting layout, e.g. p1 / p2 is equivalent to plot_layout(ncol = 1).|: shortcut for setting layout, e.g. p1 | p2 is equivalent to plot_layout(nrow = 1).&: modify all (or some) subplots, e.g. p1 / (p2 | p3) & theme_minimal() changes the theme of p1, p2, and p3.plot_annotation(): add annotations to the assembled plotinset_element(): mark a given plot as an inset and place it on top of another plot.The examples below are based on the patchwork of the following 4 subplots.
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
p2 <- ggplot(mpg, aes(x = as.character(year), fill = drv)) +
geom_bar(position = "dodge") +
labs(x = "year")
p3 <- ggplot(mpg, aes(x = hwy, fill = drv)) +
geom_density(color = NA) +
facet_grid(rows = vars(drv))
p4 <- ggplot(mpg) +
stat_summary(aes(x = drv, y = hwy, fill = drv), geom = "col", fun.data = mean_se) +
stat_summary(aes(x = drv, y = hwy), geom = "errorbar", fun.data = mean_se, width = 0.4)
+p1 + p2 + p3
plot_layout() to control the number of rows and columnsp1 + p2 + p3 + plot_layout(ncol = 2, byrow = TRUE)
/ and |p3 | (p2 / (p1 | p4))
plot_layout(design = )## use `#` to denote empty areas
layout <- "
aaa
c#b
cdb
"
p1 + p2 + p3 + p4 + plot_layout(design = layout)
plot_layout(guides = )p1 + p2 + p3 + plot_layout(ncol = 2, guides = "collect")
Electing to collect guides will take all guides and put them together at the position governed by the global theme. Further, it will remove any duplicate guide leaving only unique guides in the plot. The duplication detection looks at the appearance of the guide, not the underlying scale it comes from. Thus, it will only remove guides that are exactly alike. If you want to optimize space use byu putting guides in an empty area of the layout, you can specify a plotting area for the collected guides.
p1 + p2 + p3 + guide_area() + plot_layout(ncol = 2, guides = "collect")
p1 + geom_point(aes(color = class)) + p2 + p3 + guide_area() + plot_layout(ncol = 2, guides = "collect")
&One of the tenets of patchwork is that the plots remain as standard ggplot objects until rendered. This means that they are amenable to modification after they have been assembled. The specific plots can be retrieved and set with [[]] indexing:
p12 <- p1 + p2
p12[[2]] <- p12[[2]] + theme_light()
p12
Often though, it is necessary to modify all subplots at once. For example, the following code give plots a common axis.
p1 + p4 & scale_y_continuous(limits = c(0, 45))
plot_annotation()Titles, subtitles, captions, themes, etc. can be added to patchwork plots using the plot_annotation() function.
p34 <- p3 + p4 + plot_annotation(
title = "A closer look at the ffect of drive train in cars",
caption = "Source: mpg dataset in ggplot2"
)
p34 + plot_annotation(theme = theme_gray(base_family = "mono"))
Using & along with a theme object will modify the global theme as well as the themes of the subplots all together.
p34 & theme_gray(base_family = "mono")
Another type of annotation, known especially in scientific literature, is to add tags to each subplot that will then be used to identify them in the text and caption. ggplot2 has the tag element for this, and patchwork offers functionality to set this automatically using the tag_levels argument.
p123 <- p1 | (p2 / p3)
p123 + plot_annotation(tag_levels = "I") ## uppercase Roman numerics
An additional feature is that it is possible to use nesting to define new tagging levels:
p123[[2]] <- p123[[2]] + plot_layout(tag_level = "new") ## auto-tagging start fresh for p123[[2]]
p123 + plot_annotation(tag_levels = c("I", "a"))
The position of the inset plot is specified by the left, right, top, and bottom location of the inset, relative to either panel (default), plot, or full (the align_to argument). The default is to use npc units which goes from 0 to 1 in the given areas, but any grid::unit() can be used by giving them explicitly.
For example, the following plot places an inset exactly 10mm from the top righ corner:
p1 + inset_element(
p = p2,
left = 0.6,
bottom = 0.5,
right = unit(1, "npc") - unit(10, "mm"),
top = unit(1, "npc") - unit(10, "mm"),
align_to = "plot"
)
Insets are not confined to ggplots. Any graphics supported by wrap_elements() can be used, including patchwork plots. For example,
p24 <- p2 / p4 + plot_layout(guides = "collect")
p1 + inset_element(p24, left = 0.5, bottom = 0.5, right = 0.9, top = 0.9)
Also, insets behave like standard patchwork subplots until they are rendered. This means they are amenable to modifications after assembly, e.g. using &. Auto-tagging also works with insets.
p12 <- p1 + inset_element(p2, 0.5, 0.5, 0.9, 0.9) + plot_annotation(tag_levels = "A")
p12 & theme_bw()
The aesthetic mappings, defined in aes(), describe how variables are mapped to visual properties or aesthetics. aes(x, y, ...) takes a sequence of aesthetic-variable pairs. The first two arguments are x and y.
Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in some combination of both.
Within each layer, you can add, override, or remove mappings. For example, if you have a plot using the mpg data that has aes(displ, hwy) as the starting point, the table below illustrates all three operations:
| Operation | Layer aesthetics | Result |
|---|---|---|
| Add | aes(color = cyl) |
aes(displ, hwy, color = cyl) |
| Override | aes(y = cty) |
aes(displ, cty) |
| Remove | aes(y = NULL) |
aes(displ) |
Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters. We map an aesthetic to a variable (e.g., aes(color = cut)), or set it to a constant (e.g., color = "red"). The rules of thumb are:
aes().aes().But what happens if you specify a value inside aes()?
The second plot maps (not sets) the color to tha value ‘green’. This effectively creates a new variable containing only the value ‘green’, and then scales it with a color scale. Because this variable is discrete, the default color scale uses evenly spaced colors on the color wheel, and since there is only one value, this color is pinkish.
The third approach below also maps the value, but overrides the default scale. This is useful in cases where you have a column that already contains color values.
p1 <- ggplot(mpg, aes(cty, hwy)) +
geom_point(color = "green")
p2 <- ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(color = "green")) ## `scale_color_discrete()` is used by default
p3 <- ggplot(mpg, aes(cty, hwy)) +
geom_point(aes(color = "green")) + ## variable color is created
scale_color_identity()
p1 + p2 + p3
It is sometimes useful to map aesthetics to constants. For example, if you want to display multiple layers with varying parameters, you can “name” each layer.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(aes(color = "loess"), method = "loess", se = FALSE) +
geom_smooth(aes(color = "lm"), method = "lm", se = FALSE) +
labs(color = "Method")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
labs()When customizing a plot, it is often useful to modify the titles, axes, legends etc.. The labs() helper function provides a shorthand way to specify the name argument (see scale constructor functions such as continuous_scale() for example) to one or more scales. It allows you to supply name-value pairs, where
title, subtitle, caption, tag, or an aesthetic, such as x, color, fill.\n used to specify line breaks.You can supply mathematical expressions wrapped in quote(). The rules by which these expressions are interpreted can be found by typing ?plotmath.
It is also possible to include (some) markdown in axis and legend titles with the help of the ggtext package and the ggplot2 theme system. To enable markdown you need to set the relevant theme element to ggtext::element_markdown(), as demonstrated below:
df <- data.frame(x = 1:4, y = 1:4)
p1 <- ggplot(df, aes(x, y)) +
geom_point() +
labs(x = "Axis label with *italics* and **boldface**")
p2 <- p1 + theme(axis.title.x = ggtext::element_markdown())
p1 + p2
Setting labs(x = "") omits the label but still allocates space. Setting labs(x = NULL) removes the label and its space.
guides()In ggplot2, legend and axes are known collectively as guides.
The guides() helper function works in a similar way to labs() (see Section 7.3.1). Both take the name of different aesthetics (e.g. color, x, fill) as arguments and allow you to specify your own value. Where labs() provides a shorthand way to specify the name argument, guides() allows you to specify the guide argument to one or more scales.
However, scale guides are more complex than scale names: where the name argument (and labs()) takes text as input, the guide argument (and guides()) takes a guide object created by a guide function, such as guide_colorbar() and guide_legend(). The arguments to these functions offer additional fine control over the guide.
The table below summarizes the default guide functions associated with different scale types:
| Scale type | Default guide type |
|---|---|
| continuous scales for color/fill aesthetics | colorbar |
| binned scales for color/fill aesthetics | colorsteps |
| position scales (continuous, binned and discrete) | axis |
| discrete scales (except position scales) | legend |
| binned scales (except position/color/fill scales) | bins |
The guide functions have numerous examples in the documentation that illustrate all of their arguments. Many of the arguments to the guide function are equivalent to theme settings like text color, size, font, etc, but only apply to a single guide.
xlim(), ylim(), lims()Examples using xlim() or ylim():
xlim(10, 20): a continuous scale from 10 to 20xlim(20, 10): a reversed continuous scale from 20 to 10xlim("a", "b", "c"): a discrete scalexlim(as.Date(c("2008-05-01", "2008-08-01"))): a date scale from May 1 to August 1 2008The lims() function takes name-value pairs as input, where the name specifies the aesthetic and the value specifies the limits. For example:
p + lims(x = c(1, 7), y = c(10, 45), color = c("c", "d", "e", "p", "r"))
When saving a plot to use in another program, you have two basic choices of output: raster or vector:
Unless there is a compelling reason not to, use vector graphics: they look better in more places.
There are two main reasons to use raster graphics:
There are two ways to save output from ggplot2. You can use the standard R approach where you open a graphics device, generate the plot, then close the device:
pdf("output.pdf", width = 6, height = 6)
ggplot(mpg, aes(displ, hwy)) + geom_point()
dev.off()
This works for all packages, but is verbose. ggplot provides a convenient shorthand with ggsave():
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("output.pdf")
ggsave() is optimized for interactive use: you can use it after you’ve drawn a plot. It has the following important arguments:
filename: file name to create on diskplot: plot to save, defaults to the last plot displayed.path: specifies the path where the image should be saved. The file extension will be used to automatically select the correct graphics device. ggsave() can produce .eps, .pdf, .svg, .wmf, .png, .jpg, .bmp, and .tiff.width and height: control the output size, specified in inches. If left blank, they’ll use the size of the on-screen graphics device.dpi: for raster graphics (i.e., .png, .jpg), the dpi argument controls the resolution of the plot. It defaults to 300, which is appropriate for most printers, but you may want to use 600 for particularly high-resolution output, or 96 for on-screen display (e.g., web).Conceptually, an annotation supplies metadata for the plot: that is, it provides additional information about the data being displayed. From a practical standpoint, however, metadata is just another form of data. Because of this, the annotation tools in ggplot2 use the same geoms that are used to create other plots.
In addition, there are some helper functions in ggplot2 itself, and a number of other packages you can use.
Adding text to a plot is one of the most common forms of annotation. However, text annotation can be tricky due to the way that R handles fonts. The ggplot2 package doesn’t have all the answers, but it does provide some tools to make your life a little easier.
geom_text()The main tool for labeling plots is geom_text(), which adds label text at the specified x and y positions.
Some important arguments:
mapping: set of aesthetics mappings created by aes() or aes_(). geom_text() has the most aesthetics of any geom, because there are so many ways to control the appearance of a text. Some important aesthetics are:
x, y, label: required aestheticsfamily: provides the name of a font. This aesthetic does allow you to use the name of a system font, but there are only three fonts that are guaranteed to work everywhere: “sans” (default), “serif”, or “mono”.fontface: “plain” (default), “bold”, or “italic”.hjust, vjust: adjust the alignment of text with hjust (“left”, “center”, “right”, “inward”, “outward”) and vjust (“bottom”, “middle”, “top”, “inward”, “outward”). By default the alignment is centered. One of the most useful alignments is “inward”, which aligns text towards the middle of the plot, thus ensures that labels remains within the plot limits.size: controls font size. Unlike most tools, ggplot2 specifies the size in “mm”, rather than the usual points (pts). The reason for this choice is that it makes the units for font size consistent with how other sizes are specified in ggplot. (There are 72.27 pts in a inch, so to convert from pts to mm, just multiply by 72.27 / 25.4).angle: rotation of the text in degrees.nudge_x, nudge_y: numeric to nudge by horizontally or vertically, to offset text a little from points to avoid overlapping with them.
check_overlap: If check_overlap = TRUE, overlapping labels will be automatically removed from the plot. The algorithm is simple: labels are plotted in the order they appear in the data frame; if a label would overlap with an existing point, it’s omitted.
base <- ggplot(mpg, aes(displ, hwy)) +
xlim(1, 8)
p1 <- base + geom_text(aes(label = model)) ## model is a variable in `mpg`
p2 <- base + geom_text(aes(label = model), check_overlap = TRUE)
p1 + p2
The above plot could be (although rarely) useful. For example, if you sort the input data in order of priority, the result is a plot with labels that emphasize important data points.
geom_label()geom_label() is a variation of geom_text(). It draws a rounded rectangle behind the text. This makes it useful for adding lbels to plots with busy backgrounds.
label <- data.frame(waiting = c(55, 80), eruptions = c(2, 4.3), label = c("Peak one", "Peak two"))
ggplot(faithfuld, aes(waiting, eruptions)) +
geom_tile(aes(fill = density)) +
geom_label(data = label, aes(label = label))
Labeling data well poses some challenges:
Test does not affect the limits of the plot. Unfortunately there’s no way to make this work since a label has an absolute size (e.g. 3cm), regardless of the size of the plot. This means that the limits of a plot would need to be different depending on the size of the plot - there’s just no way to make that happen with ggplot2. Instead, you’ll need to tweak xlim() and ylim() based on your data and plot size.
If you want to label many points, it is difficult to avoid overlaps. check_overlap = TRUE is useful, but offers little control over which labels are removed. A popular technique for addressing this is to use the ggrepel package. The package supplies geom_text_repel(), which optimizes the label positioning to avoid overlap. It works quite well so long as the number of labels is not excessive.
It can sometimes be difficult to ensure that text labels fit within the space that you want. ggfittext package contains useful tools that can assist with this, including functions that allow you to place text labels inside the columns in a bar chart.
The family aesthetic to geom_text() allows user to specify the name of a system font, but some care is required. There are only three fonts that are guaranteed to work everywhere: “sans” (default), “serif”, or “mono”. To illustrate these:
df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = family, family = family)) +
scale_x_continuous(breaks = NULL)
The reason that it can be tricky to use system fonts in a plot is that text drawing is handled differently by each graphics device (GD). There are two groups of GDs:
windows() for Windows, quartz() for Macs, xll() mostly for Linux, and RstudioGD() within RStudio, which draw the plot to the screen.png() and pdf(), which write the plot to a file.Unfortunately the devices do not specify fonts in the same way. So if you want a font to work everywhere, you need to configure the devices in different ways. Two packages simply the quandary a bit:
showtext, makes GD-independent plots by rendering all text as ploygons.extrafont, converts fonts to a standard format that all devices can use.Both approaches have pros and cons, so you will need to try both of them and see which works best for your need.
Labeling individual points with text is an important kind of annotation, but it is not the only useful technique. The ggplot2 package provides several other tools to annotate plots using the same geoms you would use to display data.
For example, you can use:
geom_text(), geom_label(), as shown in Section 7.5.1
geom_rect(): highlight rectangular regions of the plot. geom_rect() has aesthetics xmin, xmax, ymin, and ymax.
geom_line(), geom_path(), geom_segment(): add lines. All these geoms have an arrow parameter, which allows you to place an arrowhead on the line. Create arrowheads with arrow(), which has arguments angle, length, ends and type.
geom_vline(), geom_hline, geom_abline(): allow you to add reference lines that span the full range of the plot.
To illustrate how ggplot2 tools can be used to annotate plots we’ll start with a time series plotting US unemployment over time:
ggplot(economics, aes(date, unemploy)) +
geom_line()
One useful way to annotate this plot is to use shading to indicate which present was in power at the time. To do this, we use geom_text() to introduce shading, geom_vline() to introduce separators, geom_text() to add text labels, and geom_line() to overlay the data on top of these background elements:
presidential <- subset(presidential, start > economics$date[1])
ggplot(economics) +
geom_rect(aes(xmin = start, xmax = end, fill = party), ymin = -Inf, ymax = Inf, alpha = .2, data = presidential) +
geom_vline(aes(xintercept = start), alpha = .2, data = presidential) +
geom_text(aes(x = start, y = 2500, label = name), vjust = 0, hjust = 0, nudge_x = 50, data = presidential) +
geom_line(aes(date, unemploy)) +
scale_fill_manual(values = c("red", "blue")) +
labs(x = "date", y = "Unemployment")
Note the use of -Inf and Inf as positions, which refer to the top and bottom (or left and right) limits of the plot.
annotate() helper functionUsing existing geoms to build custom annotations can be applied in other ways. For instance, you can use it to add a single annotation to a plot, but it’s a bit fiddly because you have to create a one row data frame:
yrng <- range(economics$unemploy)
xrng <- range(economics$date)
caption <- paste(strwrap("Unemployment rates in the US have varied a lot over the years", 40), collapse = "\n")
ggplot(economics, aes(date, unemploy)) +
geom_line()+
geom_text(aes(x, y, label = caption),
data = data.frame(x = xrng[1], y = yrng[2], caption = caption),
hjust = 0, vjust = 1, size = 4, color = "red"
)
This code works, but it is very cumbersome. So ggplot2 introduces the annotate() helper function which creates the data frame for you:
ggplot(economics, aes(date, unemploy)) +
geom_line() +
annotate(geom = "text", x = xrng[1], y = yrng[2], label = caption,
hjust = 0, vjust = 1, size = 4, color = "red")
The convenience of annotate() comes in handy in other situations. For example, a common form of annotation is to highlight a subset of points by drawing larger points in a different color underneath the main data set. To highlight vehicles manufactured by Subaru, you could do this:
p <- ggplot(mpg, aes(displ, hwy)) +
geom_point(data = filter(mpg, manufacturer == "subaru"), color = "orange", size = 3) +
geom_point()
The problem with this is that the highlighted category would not be labeled. This is easily rectified with the annotate() helper function.
p +
annotate("point", x = 5.5, y = 40, color = "orange", size = 3) +
annotate("point", x = 5.5, y = 40) +
annotate("text", x = 5.6, y = 40, label = "Subaru", hjust = 0)
This approach may cause confusion with real data. An alternative is to use a different geom to do the work. geom_curve() and geom_segment() can be used to draw curves and lines connecting points with labels, and can be used in conjunction with annotate().
p +
annotate(geom = "curve", x = 4, y = 35, xend = 2.65, yend = 27, curvature = .3, arrow = arrow(length = unit(.2, "cm"))) +
annotate(geom = "text", x = 4, y = 35, label = "Subaru", hjust = 0)
The Subaru plots above provide examples of “direct labeling”, in which the plot region itself contains the labels for groups of points, instead of using a legend. This usually makes the plot easier to ready because it puts the labels closer to the data.
The broader ggplot2 ecosystem contains a variety of other tools to accomplish this in a more automated fashion.
directlabels packagedirectlabels provides a number of position methods. smart.grid is a reasonable place to start for scatter plots, but there are other methods that are more useful for frequency polygons and line plots.
ggplot(mpg, aes(displ, hwy, color = class)) +
geom_point(show.legend = FALSE) +
directlabels::geom_dl(aes(label = class), method = "smart.grid")
ggforce packageThe ggforce package contains a lot of useful tools to extend ggplot2 functionality, including functions such as geom_mark_ellipse() that overlays a plot with circular “highlight” marks. For example:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
ggforce::geom_mark_ellipse(aes(label = cyl, group = cyl))
gghighlight packagegghighlight is another package that is useful for highlighting points or lines (or indeed a variety of different geoms) within a plot, particularly for longitudinal data.
data(Oxboys, package = "nlme")
ggplot(Oxboys, aes(age, height, group = Subject)) +
geom_line() +
geom_point() +
gghighlight::gghighlight(Subject %in% 1:3)
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: Subject
mod_coef <- coef(lm(log10(price) ~ log10(carat), data = diamonds))
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d() +
geom_abline(intercept = mod_coef[1], slope = mod_coef[2], color = "white", size = 1) +
facet_wrap(vars(cut), nrow = 1)
Another example is when you want each facet of a plot to display data from a single group, with the complete data set plotted unobtrusively in each panel to aid visual comparison. The gghighlight package is particularly useful in this context. (See also Section 5.5 for another way of doing this).
ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point() +
gghighlight::gghighlight() +
facet_wrap(vars(cyl))
ggplot_build(): takes a graphic object as argument, returns a list of data frames (one for each layer), and a panel object, which contains all information about axis limits, breaks, etc. There are some helper functions that return the data, grob, or scales associated with a given layer directly.
layer_data(plot, i = 1L): returns data frame by layerlayer_scales(plot, i = 1L, j = 1L): j is the column of a facet to return scales for.layer_grob(plot, i = 1L):